Homework 08: Higher Dimensional Continuous Data

General instructions for all assignments:

  • Use this file as the template for your submission. Delete the unnecessary text (e.g. this text, the problem statements, etc). That said, keep the nicely formatted “Problem 1”, “Problem 2”, “a.”, “b.”, etc
  • Upload a single R Markdown file (named as: [AndrewID]-HW08.Rmd – e.g. “sventura-HW08.Rmd”) to the Homework 08 submission section on Blackboard. You do not need to upload the .html file.
  • The instructor and TAs will run your .Rmd file on their computer. If your .Rmd file does not knit on our computers, you will be automatically be deducted 10 points.
  • Your file should contain the code to answer each question in its own code block. Your code should produce plots/output that will be automatically embedded in the output (.html) file
  • Each answer must be supported by written statements (unless otherwise specified)
  • Include the name of anyone you collaborated with at the top of the assignment
  • Include the style guide you used below under Problem 0


library(data.table)
library(ggplot2)
library(dplyr)
library(GGally)

my_theme <-  theme_bw() + # White background, black and white theme
  theme(axis.text = element_text(size = 12, color="indianred4"),
        text = element_text(size = 14, face="bold", color="darkslategrey"))

Problem 0

Organization, Themes, and HTML Output

(5 points)

  1. For all problems in this assignment, organize your output as follows:
  • Organize each part of each problem into its own tab. For Problems 2 and 5, the organization into tabs is already completed for you, giving you a template that you can use for the subsequent problems.
  • Use code folding for all code. See here for how to do this.
  • Use a floating table of contents.
  • Suppress all warning messages in your output by using warning = FALSE and message = FALSE in every code block.
  1. For all problems in this assignment, adhere to the following guidelines for your ggplot theme and use of color:
  • Don’t use both red and green in the same plot, since a large proportion of the population is red-green colorblind.
  • Try to only use three colors (at most) in your themes. In previous assignments, many students are using different colors for the axes, axis ticks, axis labels, graph titles, grid lines, background, etc. This is unnecessary and only serves to make your graphs more difficult to read. Use a more concise color scheme.
  • Make sure you use a white or gray background (preferably light gray if you use gray).
  • Make sure that titles, labels, and axes are in dark colors, so that they contrast well with the light colored background.
  • Only use color when necessary and when it enhances your graph. For example, if you have a univariate bar chart, there’s no need to color the bars different colors, since this is redundant.
  • In general, try to keep your themes (and written answers) professional. Remember, you should treat these assignments as professional reports.
  1. What style guide are you using for this assignment?


Problem 1

(3 points each)

a.

library(MASS)

ggplot(Cars93) + geom_text(aes(x = Fuel.tank.capacity, y = MPG.city, label = Model)) +
  ggtitle("Cars on Sale in the US in 1993 \n by Fuel Tank Capacity and City MPG") + 
  labs(x = "Car Fuel Tank Capacity (gal)", y = "Car MPG, City") +
  my_theme

b.

colorblind_palette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", 
                        "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

ggplot(Cars93) + geom_text(aes(x = Fuel.tank.capacity, y = MPG.city, 
                               label = Model, color = Type), 
                           angle = 30) +
  ggtitle("Cars on Sale in the US in 1993 \n by Fuel Tank Capacity, City MPG, and Type") + 
  labs(x = "Car Fuel Tank Capacity (gal)", y = "Car MPG, City") +
  scale_color_manual(values = colorblind_palette) +
  my_theme

c.

ggplot(Cars93) + geom_text(aes(x = Fuel.tank.capacity, y = MPG.city, 
                               label = Model, color = Type, size = RPM), 
                           angle = 30, family = "mono") +
  ggtitle("Cars on Sale in the US in 1993 by \n Fuel Tank Capacity, City MPG, \n Max RPM, and Type") + 
  labs(x = "Car Fuel Tank Capacity (gal)", y = "Car MPG, City", 
       size = "RPM at Max Horsepower", color = "Type") +
  scale_color_manual(values = colorblind_palette) +
  my_theme

d.

There are 5 variables included in the graph: Fuel tank capacity, MPG in the city, the specific model, the type of car, and the maximum RPM.

Some example findings/takeaways:

  • Small cars seem to have low fuel tank capacity and high city MPG compared to other types of cars, which makes sense given that they are small.

  • Vans have a relatively high fuel tank capacity and low MPG compared to other types of cars.

  • There seems to be a negative correlation between fuel tank capacity and MPG in the city.

  • Large cars tend to have smaller maximum RPMs.



Problem 2

(3 points each)

More Text Annotations on Graphs

Load the Olive Oil data from Lab 08.

olive <- fread("https://raw.githubusercontent.com/sventura/315-code-and-datasets/master/data/olive_oil.csv")

Part (a)

Create a bar chart of the region variable. Use the geom_text() function to add the counts in each bar above each bar in the bar chart. E.g., ggplot(...) + ... + geom_text(stat = "count", aes(y = ..count.., label = ..count..)). Adjust the vjust parameter in geom_text() in order to place the numbers in a more appropriate place. Use a single non-default color for the bars.

region.labels <- c("Northern Apulia", "Southern Apulia", "Calabria", "Sicily",
                   "Inland Sardinia", "Coastal Sardinia", "Eastern Liguria",
                   "Western Liguria", "Umbria")
area.labels <- c("South", "Sardinia", "Centre-North")

olive <- olive %>% mutate(region = as.factor(region), area = as.factor(area)) 
levels(olive$region) <- region.labels
levels(olive$area) <- area.labels
olive %>%
  ggplot(aes(x = region)) + 
  geom_bar(fill = "#cc002b") +
  ggtitle("Number of Olive Oils per Region") +
  labs(x = "Region", y = "Count") + 
  geom_text(stat = "count", aes(y = ..count.., label = ..count.., vjust = -0.4)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Part (b)

Create a bar chart of the area variable. Use the geom_text() function to add the counts in each bar in the middle of each bar (i..e halfway up each bar) in the bar chart. Use a larger-than-default font size for the text. Use a single non-default color for the bars, and change the color of the text so that it contrasts well with the color of the bars.

olive %>%
  ggplot(aes(x = area)) + 
  geom_bar(fill = "#cc002b") +
  ggtitle("Number of Olive Oils per Area") +
  labs(x = "Area", y = "Count") + 
  geom_text(stat = "count",
            aes(y = ..count.. / 2, label = ..count..),
            size = 10, colour = "#4d565b") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Part (c)

Repeat part (b), but use label = scales::percent((..count..)/sum(..count..))) to put percentages on the plot instead of the raw count scale. This is nice, since it allows you to quickly see both the percentages and the counts (height of the bars) in each category.

olive %>%
  ggplot(aes(x = area)) + 
  geom_bar(fill = "#cc002b") +
  ggtitle("Number of Olive Oils per Area") +
  labs(x = "Area", y = "Count") + 
  geom_text(stat = "count",
            aes(y = ..count.. / 2, 
                label = scales::percent((..count..)/sum(..count..))),
            size = 10, colour = "#4d565b") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



Problem 3

(3 points each; 21 points total)

2D-KDEs with Contour Plots and Adjusted Bandwidths

Part A

olive_cont = subset(olive, select = -c(1,2))
olive_cont_scale <- scale(olive_cont)
dist_olive <- dist(olive_cont_scale)
olive_mds <- cmdscale(dist_olive, k = 2)
olive_mds <- as.data.frame(olive_mds)
colnames(olive_mds) <- c("mds_coordinate_1", "mds_coordinate_2")

olive_mds$area <- olive$area
olive_mds$region <- olive$region

region.labels.mds <- c(1:9); 
levels(olive_mds$region) <- region.labels.mds

Part B

ggplot(data = olive_mds) + 
  geom_density2d(aes(x = mds_coordinate_1, y = mds_coordinate_2)) +
  ggtitle("Contour Plot:  2-D Multimensional Scaling of Olive Dataset") + 
  xlab("MDS Coordinate 1") + ylab("MDS Coordinate 2") + 
  my_theme

Part C

ggplot(data = olive_mds, aes(x = mds_coordinate_1, y = mds_coordinate_2)) + 
  geom_density2d() +
  geom_point(aes(color = area, shape = region), size = 2) +
  scale_shape_manual(values = as.character(sort(unique(olive_mds$region)))) +
  guides(color=guide_legend(title="Area"), shape=guide_legend(title="Region")) +
  ggtitle("Contour Plot:  2-D Multimensional Scaling of \n Olive Dataset By Area and Region") + 
  xlab("MDS Coordinate 1") + ylab("MDS Coordinate 2") + 
  my_theme

Part D

ggplot(data = olive_mds, aes(x = mds_coordinate_1, y = mds_coordinate_2)) + 
  geom_density2d(h = c(1,1)) +
  geom_point(aes(color = area, shape = region), size = 2) +
  scale_shape_manual(values = as.character(sort(unique(olive_mds$region)))) +
  guides(color=guide_legend(title="Area"), shape=guide_legend(title="Region")) +
  ggtitle("Contour Plot:  2-D Multimensional Scaling of \n Olive Dataset By Area and Region") + 
  xlab("MDS Coordinate 1") + ylab("MDS Coordinate 2") + 
  my_theme

Part E

ggplot(data = olive_mds, aes(x = mds_coordinate_1, y = mds_coordinate_2)) + 
  geom_density2d(h = c(5,5)) +
  geom_point(aes(color = area, shape = region), size = 2) +
  scale_shape_manual(values = as.character(sort(unique(olive_mds$region)))) +
  guides(color=guide_legend(title="Area"), shape=guide_legend(title="Region")) +
  ggtitle("Contour Plot:  2-D Multimensional Scaling of \n Olive Dataset By Area and Region") + 
  xlab("MDS Coordinate 1") + ylab("MDS Coordinate 2") + 
  my_theme

Part F

I prefer the smaller bandwidth (or even the default), because the larger bandwidth returns an extremely smooth density estimate that doesn’t provide any useful information about the underlying empirical distribution in this problem. That is, the density is highest in an area where the points are a bit sparse, and not as high where the points are more closely clustered. In general, the peaks of the density estimates should correspond to the actual locations of the points (ideally in areas where there are lots of points), and the valleys in the density estimate should correspond to areas of the feature space where there are few observations. The large bandwidth prevents this from happening.

If we are just trying to represent the density of this data then the smaller the bandwidth the better, because it will give more details that we can use to interpret this data. But, if we are trying to apply our findings to future wine data, we wouldn’t want our bandwidth to be too small because we may overfit to our particular data. (If you haven’t learned about overfitting yet, no big deal! You’ll learn this in 36-401 and 36-402.)

Part G

Repeat part (e), but this time, adjust the bandwidths in each direction as you see fit. In particular, choose a different, non-default bandwidth for each direction. Comment on why you chose these bandwidths.



Problem 4

(10 points)

2D-KDEs with Heat Maps and Three-Color Gradients

Part A

ggplot(data = olive_mds, aes(x = mds_coordinate_1, y = mds_coordinate_2)) + 
  stat_density2d(aes(fill = ..density..), geom = "tile", contour = F, h = c(2,2)) + 
  scale_fill_gradient2("Density", low = "white", mid = "yellow", high = "red", midpoint = 0.04) +
  ggtitle("Heat Map: 2-D Multimensional Scaling of Olive Dataset") + 
  xlab("MDS Coordinate 1") + ylab("MDS Coordinate 2") + 
  my_theme

Part B

I chose these colors because there is a natural progression between white, yellow, and red. Also, since yellow and red are primary colors, orange is naturally included in this gradient as well. Additionally, understanding the colors on the plot is pretty intuitive, since white is commonly used to represent the absence of observations in graphs, and red is commonly used to represent intensity.

I chose the midpoint parameter to be 0.04 because at this point, the gradient is best representing the range of density. And this midpoint accurately corresponds to the mid color (orange). If you increase the midpoint parameter, the gradient of the density scale moves towards the color set as low (in this case, white), such that the scale almost becomes one color. If you decrease the midpoint parameter, the gradient of the density scale moves towards the color set as high (red).

Part C

I prefer the contour plot over the heat map because it provides more details that we can use to interpret the data. The heat map is a more general display of areas of higher and lower density, but lacks extensive detail about the underlying empirical distribution. (Note: There’s no “right” answer here, in the sense that a contour plot isn’t always better than a heat map. They just represent the data in different ways.)

(Of course, you can always layer everything onto one plot – a heat map, a contour plot, the points themselves, a regression line, etc – if you think this adds value to the plot.)



Problem 5

Hierarchical Clustering and Dendrograms

There are several ways to create dendrograms in R. Regardless of which dendrogram package you use, you’ll first need to create the distance matrix corresponding to your dataset, and submit that distance matrix to hierarchical clustering.

Part (a)

olive_cont <- olive %>% dplyr::select(-area, -region)
olive_cont_scale <- olive_cont %>% scale()
dist_olive <-  olive_cont_scale %>% dist()

Part (b)

hc_olive_complete <- hclust(dist_olive, method = "complete")

The object hc_olive_complete contains the following features: merge, height, order, labels, method, call, and dist.method. hc_olive_complete$method contains the cluster method that has been used in constructing the hierarchical clusters, in this case it is set to complete.

Part (c)

plot(hc_olive_complete, xlab = "Olive Oil Specimen", ylab = "Cluster Merge Distance",
     sub = "", main = "Dendrogram of Hierarchical Cluster of\n Olive Oil using Complete Linkage")

plot() uses a dendrogram to visualize the hierarchical clustering results.

Part (d)

The maximum distance at which two observations are linked can be see in the dendrogram by the highest horizontal bar. This occurs at a height of roughly 11.

Part (e)

Of the two groups linked in the final iterations of hierarchical clustering the larger cluster contains roughly two-thirds of the observations and the smaller cluster constains roughly one third of the observation.

Part (f)

labels_complete_2 <- cutree(hc_olive_complete, k = 2)

labels_complete_2 is a vector of integers containing 572 elements. Note that this is the same as the number of different olive oil specimins contained in the original dataset.

Part (g)

The first cluster contains 32.3% of the olive oil specimins while the second one contains 67.7% the olive oil specimins.

Part (h)

labels_complete_3 <- cutree(hc_olive_complete, k = 3)
olive_mds <- as.data.frame(cmdscale(dist_olive, k = 2))
names(olive_mds) <- c("First MDS Coordinate", "Second MDS Coordinate")
olive_mds <- olive_mds %>% mutate(area = as.factor(olive$area),
                                  region = as.factor(olive$region),
                                  cluster = as.factor(labels_complete_3))
area.labels <- c("South", "Sardinia", "Centre-North")
levels(olive_mds$area) <- area.labels

ggplot(olive_mds,
       aes(x = `First MDS Coordinate`, y = `Second MDS Coordinate`,
           color = area, label = cluster)) +
  geom_text() + ggtitle("Scatterplot of First and Second MDS Coordinates \nof Olive Oil Data Set\nby Area and Assigned Cluster") +
  guides(color = guide_legend(title = "Area")) + my_theme

There is some correspondence between the area variable and the cluster labels but it is imperfect. The general mapping is as follows:

  • Centre-North - Cluster 1 or Cluster 2
  • South - Cluster 2
  • Sardinia - Cluster 1 or Cluster 3

Part (i)

library(dendextend)
colorblind_palette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", 
                        "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
get_colors <- function(x, palette = colorblind_palette) palette[match(x, unique(x))]

dend <- as.dendrogram(hc_olive_complete)
dend %>% 
  set("labels", olive$region, order_value = T) %>% 
  set("leaves_col", get_colors(olive_mds$cluster), order_value = T) %>% 
  ggplot(theme = my_theme) + 
  ggtitle("Dendrogram of Olive Oil Data") + 
  labs(y = "Pairwise Euclidean Distance") + 
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), 
        axis.ticks.x = element_blank())

Part (j)

dend %>% 
  set("labels", olive$region, order_value = T) %>% 
  set("leaves_col", get_colors(olive_mds$cluster), order_value = T) %>% 
  set("labels_cex", 0.2) %>%
  #set("branches_k_color", c("#999999", "#E69F00", "#56B4E9"), k = 3) %>% #color branches
  ggplot(theme = my_theme) + 
  ggtitle("Dendrogram of Olive Oil Data") + 
  labs(y = "Pairwise Euclidean Distance") + 
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), 
        axis.ticks.x = element_blank())



Problem 6

(2 points each)

a.

Things you may have done well:

  • You remembered to write your name on the exam.

  • You turned it in on time.

  • You used complete sentences, proper grammar, and correct spelling.

  • Your graphs were well-labeled and easy to interpret.

  • You didn’t say something was normally distributed when it wasn’t.

  • Your graphical interpretations were pretty good in general.

  • You remembered to fill in the missing months in 6.

  • You relabeled genders in problem 4.

  • You managed to get all the bonus points.

b.

Things you may have done wrong:

  • You didn’t go/turn it in.

  • You forgot to use the new data set.

  • You cheated and got away with it, you slimy little worm.

  • Your labels were not informative enough.

  • Your graphs didn’t help answer some parts of the questions.

  • You forgot to answer some of the questions.

  • Your chosen graphical parameters were suboptimal.

  • You didn’t talk about how you changed the data if you did so.

  • Your graphical interpretations didn’t match what the graph said.

  • You didn’t provide enough evidence for your claims.

  • You just generally made Sam and the TAs very disappointed.



Problem 7

(2 points each)

a.

Sample Good Qualities

  • Many variables plotted at once.

  • Easy to compare players on a single variable,

  • Easy to see extreme values.

  • Generally well-labeled (though a reader who doesn’t know soccer well might not know what the variables without further explanation.)

  • Not much wasted data ink.

  • Counts are scaled to be per 90 minutes of playing time.

b.

Sample Bad Qualities

  • Difficult to discern when the colors overlap.

  • Range of the axes changes for each variable.

  • Can’t compare to league average; no sense of how these axis ranges compare to all other players.

  • The graph makes it look like we can compare across variables, even though we can’t (e.g., what does it mean that Ronaldo’s shots has a larger radius than Messi’s pass percentage?)

  • Could use a better title (pretty useless for those who don’t know who Lionel Messi, Cristiano Ronaldo, and Luis Suarez are.)

  • Potential distortion (part c.)

c.

There can be distortion in radar charts. In the case where the scale of the variables in the data are different, the filled area of the polygon created by the axes is dependent on the relative ordering of the variables in the radar chart.



BONUS

(BONUS: 10 points)

Beyond Default Pairs Plots

Create a subset of the olive dataset. Include the area and region variables in your subset, and pick three continuous variables to include in your subset (your choice). Call this subset olive_sub. In your subset, overwrite the area and region variables so that they are treated as factor variables.

Create a pairs plot of your subset. Within the ggpairs() function, manually specify the upper triangle of plots to change type of information shown.

It may help to use the following code as an outline of your solution:

region.labels <- c("Northern Apulia", "Southern Apulia", "Calabria", "Sicily",
                   "Inland Sardinia", "Coastal Sardinia", "Eastern Liguria",
                   "Western Liguria", "Umbria")
area.labels <- c("South", "Sardinia", "Centre-North")

olive_sub <- olive %>%
  dplyr::select(area, region, palmitic, palmitoleic, stearic, oleic) %>%
  mutate(region = as.factor(region), area = as.factor(area)) 
olive_sub$area <- as.factor(olive_sub$area)
olive_sub$region <- as.factor(olive_sub$region)

olive_sub %>%
  GGally::ggpairs(upper = list(continuous = "density",
                               discrete = "ratio",
                               combo = "facetdensity"),
                  title = "Exploring Discrete and Continuous Variables in Olive Data"
  )

In the upper triangle:

  • Use a 2-D density estimate for pairs of continuous variables
  • Show the ratio for pairs of discrete variables
  • Use a faceted density estimate for a combination of a discrete and a continuous variable

See the examples in the help documentation for how to do this.